Performance Analysis of Three Text-Join Algorithms

نویسندگان

  • Weiyi Meng
  • Clement T. Yu
  • Wei Wang
  • Naphtali Rishe
چکیده

When a multidatabase system contains textual database systems (i.e., information retrieval systems), queries against the global schema of the multidatabase system may contain a new type of joins-joins between attributes ol textual type. Three algorithms for processing such a type of joins are presented and their l/O costs are analyzed in this paper. Since such a type of joins often involves document collections of very large size, it is very important to find etficrent algorithms to process them. The three algorithms differ on whether the documents themselves or the inverted files on the documents are used to process the join. Our analysis and the simulation results indicate that the relative performance of these algorithms depends on the input document collections, system characteristics, and the input query. For each algorithm, the type of input document collections with which the algorithm is likely to perform well is identified. An integrated algorithm that automatically selects the best algorithm to use is also proposed. lndexTerms-Query processing, textual database, iniormation retrieval, join algorithm, multidatabase.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

gSSJoin: a GPU-based Set Similarity Join Algorithm

Set similarity join is a core operation for text data integration, cleaning, and mining. Previous research work on improving the performance of set similarity joins mostly focused on sequential, CPU-based algorithms. Main optimizations of such algorithms exploit high threshold values and the underlying data characteristics to derive efficient filters. In this paper, we investigate strategies to...

متن کامل

Effective Early Termination Techniques for Text Similarity Join Operator

Text similarity join operator joins two relations if their join attributes are textually similar to each other, and it has a variety of application domains including integration and querying of data from heterogeneous resources; cleansing of data; and mining of data. Although, the text similarity join operator is widely used, its processing is expensive due to the huge number of similarity comp...

متن کامل

Summary of Revision Title: Eecient Join-index-based Spatial-join Processing: a Clustering Approach

Comment: What are the complexities of the proposed algorithms (such as SC)? Response: We have added text summarizing an analysis of the computational complexity. This text appears in Section 2.2 (last paragraph) and Section 4.4 (last paragraph), and is reproduced in Appendix A of this document. We also added additional text and gure in the \Scope" (Section 1.3) to show the diierence between spa...

متن کامل

Parallel Pointer-Based Join Algorithms in Memory-mapped Environments

Three pointer-based parallel join algorithms are presented and analyzed for environments in which secondary storage is made transparent to the programmer through memory mapping. Buhr, Goel, and Wai [11] have shown that data structures such as B-Trees, R-Trees and graph data structures can be implemented as efficiently and effectively in this environment as in a traditional environment using exp...

متن کامل

Parallel Hash-Based Join Algorithms for a Shared-Everything

We analyze the costs, and describe the implementation, of three hashed-based join algorithms for a general-purpose shared-memory multiprocessor. The three algorithms considered are the Hashed Loops, GRACE and Hybrid algorithms. We also describe the results of a set of experiments which validate the cost models presented and demonstrate the relative performance of the three algorithms.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Knowl. Data Eng.

دوره 10  شماره 

صفحات  -

تاریخ انتشار 1998